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We propose a metric for the space of multiple sequence alignments that can be used to compare 
two alignments to each other. In the case where one of the alignments is a reference alignment, 
the resulting accuracy measure improves upon previous approaches, and provides a balanced as- 
sessment of the fidelity of both matches and gaps. Furthermore, in the case where a reference 
alignment is not available, we provide empirical evidence that the distance from an alignment 
produced by one program to predicted alignments from other programs can be used as a control 
for multiple alignment experiments. In particular, we show that low accuracy alignments can be 
effectively identified and discarded. 

We also show that in the case of pairwise sequence alignment, it is possible to find an alignment 
that maximizes the expected value of our accuracy measure. Unlike previous approaches based on 
expected accuracy alignment that tend to maximize sensitivity at the expense of specificity, our 
method is able to identify unalignable sequence, thereby increasing overall accuracy. In addition, 
the algorithm allows for control of the sensitivity/specificity tradeoff via the adjustment of a single 
parameter. These results are confirmed with simulation studies that show that unalignable regions 
can be distinguished from homologous, conserved sequences. 

Finally, we propose an extension of the pairwise alignment method to multiple alignment. Our 
method, which we call AMAP, outperforms existing protein sequence multiple alignment programs 
on benchmark datasets. A webserver and software downloads are available at 
http : //bio . math . berkeley . edu/amap/ 



1 Introduction 

A recent survey on sequence alignment [2| dis- 
cusses a number of important problems and 
challenges that need to be overcome in order 
to facilitate large-scale comparative analysis of 
the multiple genomes currently being sequenced. 
Among these, the following two problems are 
highlighted: 

1. "As suggested methods to evaluate 

alignment accuracy. This goes at the core 
of the problem: which regions are alignable, 



"Corresponding author 



and what is a correct alignment?" 

2. "A definition of alignability - at what point 
is it no longer possible to do meaningful se- 
quence alignment. Or rather, at what point 
can one conclude that two sequences are no 
longer related?" 

Furthermore, the development of "rigorous 
methods for evaluating the accuracy of an align- 
ment" and the need for "improved pairwise align- 
ment with a statistical basis" are singled out as 
the most pressing challenges for the alignment 
community. 
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Two commonly used alignment accuracy mea- 
sures are the developer (/d) and the modeler 
(/m) measures \Fo\. These measures correspond 
to evaluating the sensitivity (number of correctly 
matched pairs divided by the number of matched 
pairs in the reference alignment) and specificity 
(number of correctly matched pairs divided by 
the number of matched pairs in the predicted 
alignment) of matched pairs in the predicted 
alignment respectively. Both of these measures 
have a problem, in that they do not account for 
gap columns, j^j defined the agreement score, 
as the fraction of pairwise columns in the pre- 
dicted alignment that agree with the reference 
alignment. While the agreement score does con- 
sider gap columns, it is not symmetric, since the 
number of columns in the predicted alignment 
can differ from the number of columns in the 
reference alignment. 

What properties should an alignment accuracy 
measure satisfy? If we think of accuracy measure 
as a "distance" , then if h % and h? are two align- 
ments and we denote their distance by d{h l , h J ), 
we would like to have: 

d{h\h?) > 0, 

= if, and only if, K ~ h j , (1) 

d(h\h j ) = dQjrf), (2) 

d{h\h k ) < d(h\h j )+d(h j ,h k ). (3) 

The first condition specifies that the distance 
between two alignments should be non-negative 
and should be if, and only if, the two align- 
ments are equivalent (a useful definition of equiv- 
alence is given in section |2J). The second require- 
ment specifies that the distance should be sym- 
metric. For example, comparing a prediction 
with a reference alignment should be the same 
as comparing the reference alignment to the pre- 
diction. The third requirement ensures a certain 
consistency: the distance between two predic- 
tions should be less than the sum of the distances 
from the predictions to a reference alignment 
(the triangle inequality). In other words, an ac- 
curacy measure should be based on a metric. 
Furthermore, the accuracy measure should ac- 
count for unalignable sequence. For example, if 
two sequences are unrelated, the true alignment 



contains only gaps (regardless of order), and a 
good accuracy measure should reflect that. Note 
that although metrics on the space of sequences 
have been constructed [TJi|, a metric for align- 
ment accuracy should be defined on the space 
of sequence alignments, and should measure the 
distance between alignments not sequences. 

In section |2] we define an alignment metric, 
and a new accuracy measure, called AMA, based 
on this metric. Next, we propose an algorithm 
which maximizes the expected AMA value given 
an underlying probabilistic model for mutation 
of sequences. We term this alignment strategy 
AMAP. The algorithm is explained in detail in 
section |31 where we show that a single parame- 
ter we term gap-factor (Gf) can be used to ad- 
just the sensitivity/specificity tradeoff. A spe- 
cial case of AMAP, where we set Gt = 0, is the 
Maximum Expected Accuracy (MEA) alignment 
algorithm introduced in [SI IHJ and used in Prob- 
Cons U3, and Pecan [T3] . 

In section |I] we show how existing algorithms 
perform when judged using AMA, and contrast 
this with the developer score, which is the stan- 
dard measure used in most papers H7j . 
Since the developer score is a measurement of 
sensitivity of aligned pairs, and algorithms have 
traditionally been judged by it, we find that ex- 
isting algorithms are heavily biased in favor of 
sensitive, rather than specific alignments. An ex- 
treme case of this can be seen in Table where 
we show that existing algorithms align large frac- 
tions of completely unrelated sequences. We also 
see that multiple alignments produced by differ- 
ent programs differ considerably from each other, 
even though they may appear to perform simi- 
larly when judged only by the developer score. 

In a different application of the alignment met- 
ric, we show that in the typical case where ref- 
erence alignments are not available for judging 
the success of multiple alignment experiments, 
the metric can be used as a control by measur- 
ing the distances between alignments predicted 
by different programs. 

Finally, we analyze the performance of the new 
AMAP algorithm, using the SABmark dataset 
|19j and simulated datasets, and show that 
AMAP compares favorably to other programs. 
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2 Metric Based Alignment 
Accuracy 

Following the notation in |12j . an alignment 
of a pair of sequences a 1 = a\a\ - ■ ■ a\ and 
a 2 = o\a\ ■ ■ ■ can be represented by an edit 
string h over the edit alphabet {H,I,D}, where 
H stands for homology, / for insertion, and D 
for deletion. Equivalently, if A nt m is the set of 
all alignments of a sequence of length n to a se- 
quence of length m, and h G A n , m , then h can be 
represented by a path in the alignment graph, or 
a sequence of pairs of the form (ajoa 2 ), (<t 1 o— ), 
or (o~ 2 o — ) where the symbol o is used to indicate 
the alignment of two characters. For h G -4 n ,m 
let 

• h H = {(i,j) : (ajoa]) G h}, 

• h D = {i : (aj o -) G h}, 

• h! = {j : (a 2 o — ) G h}. 



Less formally, hfj is the set of position pairs in 
a 1 and a 2 that are aligned according to h, ho 
is the set of position in a 1 that are gapped, and 
hn is the set of position in a 2 that are gapped. 
Note that for any h G -4. n ,m 



\hn\ + |^d| = ti and + |/t/| 



m. 



(4) 



Two alignments are equivalent if they align the 
same character pairs, while the order of inser- 
tions and deletions between any two consecutive 
aligned pairs is redundant. We therefore define: 



h % ~ h j if and only if h l H = h j H Vh l , h j G A n m - 

.'(5) 

Note that h l H = h ] H if and only if h l j = h 3 j and 
hjj = h? D . We can therefore use the following 
equivalent definition: 

h l ~ h? if and only if h\ = h\ and h' l D = h j D 

W,h>€An,m. (6) 

We say that two alignments are distinct if they 
are not equivalent. The number of distinct align- 
ments is in bijection with lattice paths in the 
square grid (proof omitted): 



Proposition 1. The number of distinct align- 
ments in A ntm is ( n + m ) • 

Given a predicted alignment h p and a refer- 
ence alignment h r , the fr> and /m measures are 
defined: 



f(h\hi) = 


K\ 


(7) 


f{h r ,hP) = 


\h r H nh p H \ 


(8) 


f(h p ,h r ) = 


\h r H nh p H \ 
K\ 


0) 



Note that both measures do not explicitly use 
the I and D characters in W and h p , and are not 
well defined when h r or h p do not include any H 
characters. 

A distance function between any two align- 
ments h l , hi G A n ,m is needed in order to eval- 
uate the quality of a predicted alignment given 
a reference alignment. Such a distance function 
should satisfy: 



d{h\h j ) > 

d{h\h j ) = iff tir^h? 

d{h i ,h j ) = d(h j ,h i ) 



(10) 

V/l , hp G ^ri,m; 

(ii) 

V/i , hp G >A.ri,mi 

(12) 

d{h\h j )+d{h j ,h k ) >d(h\h k ) W,h j ,h k G A nn 

(13) ' 

While these requirements are not satisfied by , 
they are satisfied by the following: 



d(h\h j ] 



2|/4| + \h\\ + |/4|-2|/4n^ 



H 



\h\ n h\\ - \K- D n h 3 D \ 



= 2\h j H \ + \h{\ + \h j D \-2\h i H nh j H \ 

-\h\ n h 3 j\ - |/i J D n h 3 D \ 

= n + m - 2\h l H n h 3 H \ 

-\h\nhPj\ - l^n^l). (14) 

Proposition 2. d(/i J ,/i- ? ) is a finite metric for 

A n .m • 
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Proof: It is easy to see that d(h l , h 3 ) satisfies 
requirements (fTUl) . (fTTl) . and ((T2l . We need to 
show that it satisfies the triangle inequality (|13|) . 
Let Uij = 2\h l H nh j H \ + \h} n + |/t^ n/^J, and 

= 2|/i^ n ^ n /&| + |/4n/i}n/$| + |/i 4 D n 

P /ip | . Using the fact that — Uijk > and 
Uij + Ujk — Uijk < n + m, we have that , /t J ) + 



d(/^',/i fc ) - d{h\h k ) 
= n + m- {Uij + f/jfc 



n + m — Uij 
- U^k) + Uik 



Ujk + t/jfc 
f/iifc > o. 



Example 3 (Metric for ^2,2) • By Proposition 
there are six distinct alignments in A%,i- The 
metric is: 





HH HDI DIH IHD DHI DDII 


HH 





2 


2 


4 


4 4 


HDI 


2 





4 


3 


3 2 


DIH 


2 


4 





3 


3 2 


IHD 


4 


3 


3 





4 2 


DHI 


4 


3 


3 


4 


2 


DDII 


4 


2 


2 


2 


2 



Intuitively, the distance between two align- 
ments is the total number of characters from 
both sequences that are aligned differently in 
the two alignments. Alternatively, the quantity 
g(h l ,hi) = 1 — d ^+m - is a convenient similarity 
measure that can be interpreted as the fraction 
of characters that are aligned the same in both 
alignments. We therefore define the Alignment 
Metric Accuracy (AM A) of a predicted align- 
ment h p given a reference alignment h r to be 
g(h p , h r ). The intuitive motivation for this accu- 
racy measure is that it represents the fraction 
of characters in a 1 and a 2 that are correctly 
aligned, either to another character or to a gap. 

AMA can easily be extended to multiple se- 
quence alignments (MSA) by using the sum-of- 
pairs approach. Let Aii,n 2 ,— ,n fc De the space of 
all MS As of k sequences of lengths n\ to n&. 
Given two MSAs h l , h? G A ni ,n 2 ,...,n k , 



fc-1 



(15) 



=1 s 2 >s 



all-gap columns removed. The similarity of two 
MSAa is defined to be 



g(h*,h r 



1 



d(h p , h r ) 



(16) 



71: 



Unlike standard sum-of-pairs scoring, our defini- 
tion follows from the requirement that our accu- 
racy measure should be based on a metric, and 
the multiple AMA retains the desirable proper- 
ties of the pairwise AMA. 
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AMA Based Alignments 

expected 



Maximum 
alignments 



accuracy 



Given a probabilistic model for alignments, such 
as a pair-HMM, an alignment of a pair of se- 
quences is typically obtained by the Viterbi al- 
gorithm j^U], which finds the global alignment 
with highest probability. In the case of a pair- 
HMM with three states, the Viterbi algorithm is 
equivalent to the standard Needleman-Wunsch 
algorithm with affine gap scores 0. In effect, 
the Viterbi algorithm maximizes the expected 
number of times that a predicted alignment is 
identical to the reference alignment {h p = h r ). 
However, when the probability of the most likely 
alignment is low it might be more desirable to 
predict alignments that are likely to align the 
most number of characters correctly on average 
even if they are less likely to be identical to the 
correct alignment. 

An alternative to Viterbi alignment is the op- 
timal accuracy alignment also called max- 
imum expected accuracy (MEA) alignment 0, 
which maximizes the expected jo score. The 
MEA alignment is calculated using a dynamic 
programming algorithm that finds the alignment 
h p that maximizes the expected number of cor- 
rectly aligned character pairs: 



h p 



argmax 



E 



P(a}Oa]\a\a 2 ,e), (17) 



where P(ajOo-j\a , a 2 , 9) is the posterior proba- 
where h\ x ^ is the pairwise alignment of se- bility that a} is homologous to a 2 given a 1 , a 2 
quences s , s 2 as projected from the MSA h % with and the parameters of the model 9. In the case 
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of a pair-HMM, these posterior probabilities can 
be computed in 0{nm) time using the Forward- 
Backward algorithm 

3.2 The AMAP algorithm 

While the MEA algorithm maximizes the ex- 
pected fo score it can perform very poorly on 
the /m score when the reference alignment con- 
tains many unaligned characters (gaps), since it 
tends to over-align characters. Maximizing the 
expected fu score can be done easily by only 
aligning the pair of characters with highest pos- 
terior probability to be homologous. This will 
result in an alignment with only one H charac- 
ter, n — 1 D characters, and m — 1 I charac- 
ters, which in most cases will result in a poor 
fo score. There is currently no alignment al- 
gorithm that allows for the adjustment of the 
sensitivity/specificity tradeoff (fo / fivi trade- 
off). However, we show that it is possible to 
maximize the expected AMA value using an al- 
gorithm similar to the original MEA algorithm. 
By maximizing the expected AMA, we avoid the 
problems of MEA alignment: in contrast to the / 
function, the g function is symmetric, and there- 
fore the sensitivity (g(h r ,h p )) equals the speci- 
ficity (g(h p , h r )). In addition to maximizing the 
expected AMA value, the new algorithm, which 
we call AMAP allows for the modification of one 
parameter, which we term gap-factor,oi Gf, to 
tune the fo / /m tradeoff. 

Let P(ajO — jcr 1 , o~ , $) be the posterior proba- 
bility that a} is not homologous to any character 
in cr 2 , and P(cr|0 — {a 1 , a 2 , 6) the posterior prob- 
ability that cr 2 is not homologous to any charac- 
ter in cr 1 . AMAP should find the alignment h p 
that maximizes the expected number of charac- 
ters that are correctly aligned to another char- 
acter or to a gap: 

h v = argmax ^ P (cr 1 Oa) \ a 1 , cr 2 , 9) + 

heAn ' m (i,j)eh p H 

G f £p(a*O-|*V,0)+ 
G/^P^O-Ict 1 ,^,^). 



Note that when Gf = 0.5 the algorithm max- 
imizes the expected AMA value, while when 
Gf = the expression in (|18jl is equal to the 
expression in (|17|) . and the algorithm is identical 
to the original MEA algorithm, which maximized 
the expected fr> score. Setting Gf to higher val- 
ues than 0.5 results in better /m scores in the ex- 
pense of lower fD scores. In effect, the gap- factor 
parameter allows for the tuning of the Jd/Im 
tradeoff. 

An MEA subroutine can be used to construct 
multiple alignments using a number of different 
strategies, one of which is progressive alignment. 
Expected AMA maximization was performed in 
this context by using the ProbCons platform (the 
code is available in open source under the Gnu 
public license) with the MEA algorithm mod- 
ified so that the developer score is no longer 
maximized. Our resulting AMAP algorithm also 
omits the heuristic consistency transformation of 
ProbCons, and simply aligns pairwise alignments 
along a guide tree. 

4 Results 

4.1 Performance of existing programs 
on the SABmark datasets 

We began by assessing the performance of exist- 
ing programs on the SABmark 1.65 datasets 
with the goal of comparing alignment metric ac- 
curacy with previously used measures. SAB- 
mark includes two sets of pairwise reference 
alignments with known structure from the AS- 
TRAL 0] database. The Twilight Zone set con- 
tains 1740 sequences with less than 25% iden- 
tity divided into 209 groups based on SCOP 
folds ^Uj. The Superfamilies set contains 3280 
sequences with less than 50% identity divided 
into 425 groups. Additionally, each dataset has 
a "false positives" version, which contains unre- 
lated sequences with the same degree of sequence 
similarity in addition to the related sequences. 

Table ^ shows the performance of a number 
of existing alignment programs as measured by 
the developer, modeler, and AMA accuracy mea- 
sures on the four SABmark 1.65 datasets. Meth- 
ods tested include Align-m 2.3 19 , CLUSTALW 
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Twilight 




Superfamilics 


Twilight-FP 


Superfamilies-FP 


Program 


Id 


hi 


AM A 


Id 


Im 


AM A 


Id 


Im 


AM A 


Id 


Im 


AM A 


Align-m 


21.6 


23.6 


51.7 


49.2 


45.6 


56.9 


17.8 


6.4 


81.5 


44.8 


16.8 


77.5 


CLUSTALW 


25.6 


14.7 


24.9 


54.0 


38.1 


43.8 


20.4 


2.4 


35.5 


50.9 


7.4 


37.0 


MUSCLE 


27.3 


16.4 


27.6 


56.3 


40.3 


46.4 


19.4 


2.3 


37.1 


49.7 


7.5 


38.9 


ProbCons 


32.1 


21.1 


37.3 


59.8 


44.4 


51.8 


26.7 


4.4 


55.7 


56.0 


10.9 


55.0 


T-Coffee 


29.4 


19.6 


35.6 


58.4 


43.7 


50.9 


26.5 


4.2 


54.1 


57.0 


11.0 


54.4 



Table 1: Performance of aligners on the SABmark benchmark datasets. Entries show the 
average developer (Id), modeler (Im) and alignment metric accuracy (AM A). Best results are 
shown in bold. All numbers have been multiplied by 100. 



1.83 [H], MUSCLE 3.52 Q, ProbCons 1.1 
and T-Coffee 2.49 11. The results highlight the 
inherent sensitivity /specificity tradeoff. While 
ProbCons and T-Coffee have the best developer 
scores, Align-m has the best modeler scores. It is 
not clear which program outperforms the others. 
Programs with higher sensitivity tend to over- 
align unalignable regions, which results in lower 
specificity. We would like to answer the ques- 
tion, which program produces alignments that 
are the closest to the reference alignments? This 
is exactly the interpretation of the new AMA 
measure. Using this measure it is clear that 
Align-m is the most accurate alignment program 
among the ones tested on the SABmark bench- 
mark datasets. 1 

4.2 Controls for multiple alignment 
experiments 

The reference alignments in SABmark are them- 
selves somewhat subjective, as they are based 
on structural alignment programs. A recurring 
question has been how to judge the accuracy 
of alignment in the absence of a reference. To 
demonstrate how AMA is useful for that, we 
compared the total distance and average simi- 
larity (g) between the alignments produced by 
the five alignment programs, two variants of the 
AMAP algorithm, and the reference alignments. 
Table 12 shows these values averaged over the en- 
tire SABmark dataset. An interesting observa- 
tion is that there is a correlation between the 

1 Note that the SABmark benchmark dataset was com- 
piled by the same authors as Align-m. 



distances of the alignments of any given pro- 
gram to the reference alignment, and their dis- 
tances to the closest alignments of other pro- 
grams. For example, CLUSTALW alignments 
are 37% similar on average to the reference SAB- 
mark alignments, and 39.5% similar to AMAP 
alignments, while Align-m alignments are 67% 
similar to the reference alignments, and 68.7% 
similar to AMAP-4 alignments. 

We propose to use the similarity of predicted 
alignments from different alignment program as 
a control before using a predicted alignment. 
Figure ^ shows the correlation between the ac- 
curacy (AMA) of of CLUSTALW alignments of 
the Twilight-FP and Superfamilies-FP datasets, 
and their maximum similarity to any of the align- 
ments produced by the other four programs. It is 
evident that there is a strong correlation between 
the two values. We observed similar correlation 
for the other alignment programs (see supple- 
mentary data). In the case of the two datasets 
with no false positives (Twilight and Superfami- 
lies) the maximum similarity behaves more as an 
upper bound rather than a strong predictor for 
the AMA value (data not shown). This is still 
useful for discarding alignments that are not sim- 
ilar to other predicted alignments, since they are 
very unlikely to be similar to the true alignment. 

To summarize, the following protocol can be 
used to identify bad alignments in the absence of 
a reference alignment: 

1. Align the target sequences with a preferred 
alignment program. 

2. Align the target sequences with all other 
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Align-m 


/"IT T TO A T TIT 

CLUSTALW 


MUSCLE 


ProbCons 


T-Coffce 


A TV T A Tt 

AMAP 


A H T A Tt A 

AMAP-4 


Referen 


Align-m 




37.3 


39.9 


52.8 


50.4 


64.7 


68.7 


67.0 


CLUSTALW 


45.0 




38.2 


38.1 


39.1 


39.5 


35.9 


37.0 


MUSCLE 


43.8 


49.4 




43.2 


42.3 


43.9 


38.9 


39.2 


ProbCons 


32.0 


48.7 


47.1 




52.0 


67.2 


53.4 


51.1 


T-Coffee 


30.1 


47.0 


45.9 


38.7 




56.6 


50.9 


50.1 


AMAP 


15.5 


45.1 


43.7 


29.8 


29.4 




72.9 


64.4 


AMAP-4 


13.3 


45.6 


44.3 


32.1 


29.9 


11.6 




70.2 


Reference 


13.7 


44.1 


43.0 


31.1 


28.6 


13.8 


11.4 





Table 2: Total distance and average similarity (g) of different aligners on the SABmark 
dataset. Values below the diagonal show the total distance between alignments produced by 
different alignment programs. Values above the diagonal show the average similarity (g) between 
the different alignments. Distance values have been divided by one million, and similarity values 
have been multiplied by 100. AMAP-4 is the AMAP algorithm with gap-factor of 4. 



available alignment programs. 

3. Measure the similarity of the first alignment 
to every other alignment, 

4. If the similarity of the closest alignment to 
the first alignment is below a certain thresh- 
old, discard the alignment. 

The above procedure has no mathematical guar- 
antees, but our empirical results show that most 
of the discarded alignments will have an AMA 
value less than the selected threshold. 

4.3 Performance of the AMAP algo- 
rithm 

Next, we investigated, using pairwise alignments, 
whether the AMAP algorithm can improve on 
the Viterbi and the Maximum Expected Accu- 
racy (MEA) alignment algorithms for maximiz- 
ing the AMA. 

We first evaluated the algorithms with the 
same default parameters used in ProbCons (5 = 
0.01993, £ = 0.79433, TT match = 0.60803, and 
emission probabilities based on the BLOSUM62 
matrix). Table shows the results of the Viterbi 
algorithm and the AMAP algorithm with differ- 
ent gap-factor values on the SABmark Twilight 
Zone set, and table |1] show the results on the 
SABmark Superfamilies set. 

The results on both sets show the expected 
correlation between the gap-factor value and the 
Jm score, and the negative correlation between 




0.2 0.4 0.6 0.B 1.0 



Figure 1: Correlation between the AMA 
of CLUSTALW (as judged by reference 
alignments in SABmark), and distance 
to the nearest alignment produced by 
another program. Each dot in the plot 
corresponds to one CLUSTALW alignment in 
the SABmark Twilight-FP and Superfamilies- 
FP datasets. The x coordinate represents the 
similarity (g) of the CLUSTALW alignment 
to the closest alignment produced by one of 
four other programs (Align-m, MUSCLE, Prob- 
Cons, T-Coffee). The y coordinate represents 
the Alignment Metric Accuracy (AMA) of the 
CLUSTALW alignment as judged by the refer- 
ence SABmark alignment. 



the gap-factor value and the fo score. This 
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Default transition probabilities "Correct" transition probabilities 



Algorithm 


fn 

J L) 


J M 


AMA 


Algorithm 


fn 


J M 


AMA 


\/ lfprni 

V 1 UCl Ul 


27.2 


Ifi 


28 


V lfprni 


1 8 8 


I 7 

II .u 


46 7 


Gf — o 


29 fi 


17 7 

±1.1 


29 


Gf — 


25 9 


1 7 5 

1 1 .U 


37 2 

•ft . i— 


G t — 5 

Kjf U.O 


28.1 


19.7 


37.4 


Gf = 0.5 


17.2 


34.7 


56.3 


Gt — 1 


25.8 


22.8 


45.2 


Gf — 1 


14.1 


42.2 


57.3 




22.4 


27.6 


51.5 


G f = 2 


11.3 


52.2 


57.3 


G> = 4 


18.9 


33.1 


54.8 


G f =4 


8.9 


59.3 


56.7 


Gf = 8 


15.9 


38.9 


56.4 


G f = 8 


7.0 


68.7 


56.0 


G f = 12 


14.3 


43.3 


56.7 


G f = 12 


6.1 


74.5 


55.6 


G/ = 16 


13.1 


46.1 


56.8 


G f = 16 


5.5 


77.3 


55.4 


G / = 20 


12.4 


48.0 


56.8 


G / = 20 


5.1 


80.2 


55.2 


G / = 28 


11.3 


51.5 


56.7 


G / = 28 


4.6 


83.1 


55.0 



Table 3: Performance of algorithm variants on the SABmark Twilight Zone set. Entries 
show the Jd, /m, and AMA scores of the Viterbi, and AMAP alignments with different gap- factor 
(Gf) values on the SABmark Twilight Zone set, which includes 209 alignment groups. The first 
five columns show the results using default transition probabilities, and the last five columns show 
the results using transition probabilities calculated for each group from the reference alignments. 
All scores have been averaged over groups and multiplied by 100. 



Default transition probabilities "Correct" transition probabilities 
Algorithm f D f M AMA Algorithm f D f M AMA 



Viterbi 


53.1 


38.1 


44.2 


Viterbi 


46.8 


41.3 


53.3 


G f = 


54.8 


39.3 


45.2 


G f = 


52.4 


40.1 


49.2 


G f = 0.5 


53.6 


42.0 


49.8 


Gf = 0.5 


45.4 


56.1 


60.5 


G f = l 


51.6 


46.2 


54.5 


Gf = 1 


41.8 


63.4 


61.5 


G f = 2 


48.1 


52.1 


58.2 


G f = 2 


37.9 


70.9 


61.2 


Gf = A 


44.1 


58.5 


59.9 


Gf = A 


34.1 


75.9 


60.0 


G f = 6 


41.8 


62.0 


60.2 


Gf = 6 


31.9 


78.4 


59.2 


G f = 8 


40.2 


64.3 


60.1 


G f = 8 


30.5 


79.9 


58.6 



Table 4: Performance of algorithm variants on the SABmark Superfamilies set. Entries 
show the fo, /m, and AMA scores of the Viterbi, and AMAP alignments with different gap- factor 
(Gf) values on the SABmark Superfamilies set, which includes 425 alignment groups. The first five 
columns show the results using default transition probabilities, and the last five columns show the 
results using transition probabilities calculated for each group from the reference alignments. All 
scores have been averaged over groups and multiplied by 100. 



used, the alignment accuracy is almost identical 
to that of the Viterbi algorithm. The most ac- 
curate alignments were achieved by setting the 
gap-factor to values higher than 0.5 (20 in the 
Twilight Zone set and 6 in the Superfamilies 
set). We suspected that this is due to the fact 
that the default pair-HMM parameters underes- 



validates the prediction that the gap-factor can 
be used as a tuning parameter for the sensitiv- 
ity/specificity tradeoff of matched characters. 

When the gap-factor is set to 0.5 or higher the 
alignment accuracy is significantly better than 
Viterbi alignments, when the original ME A al- 
gorithm (AMAP with gap-factor set to 0) is 
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Parameter Settings fjj /m AMA 



e 


5 


e 





0.5 


1 


Vit 





0.5 


1 


Vit 





0.5 


1 


Vit 


30 


10.0 


90 


7.1 


1.1 


0.8 


0.6 


4.0 


61.7 


73.7 


2.2 


9.5 


51.0 


50.9 


44.7 


50 


10.0 


90 


23.6 


3.9 


2.1 


12.5 


15.7 


77.0 


83.5 


10.8 


27.2 


52.2 


51.4 


30.3 


60 


10.0 


90 


33.6 


17.7 


12.2 


24.6 


23.2 


69.5 


82.9 


23.8 


34.5 


57.2 


55.9 


41.2 


80 


10.0 


90 


61.2 


53.4 


46.6 


53.6 


45.9 


72.1 


81.9 


51.7 


57.3 


70.8 


70.7 


62.4 


80 


5.0 


90 


83.2 


80.8 


77.5 


81.1 


76.3 


85.3 


89.9 


77.7 


78.8 


82.4 


82.2 


78.6 


80 


2.0 


98 


94.4 


93.5 


92.7 


92.9 


89.0 


95.0 


96.2 


93.6 


93.0 


95.4 


95.3 


94.6 


80 


0.9 


98 


97.4 


97.2 


96.6 


96.8 


95.2 


97.5 


98.2 


96.6 


96.2 


97.2 


97.2 


96.7 


30 


0.1 


98 


94.5 


94.5 


93.8 


90.1 


93.7 


93.8 


94.1 


89.3 


93.1 


93.1 


92.6 


88.5 


50 


0.1 


98 


99.4 


99.4 


99.4 


99.3 


99.4 


99.4 


99.4 


99.3 


99.2 


99.2 


99.2 


99.0 


unrelated 




100 


100 


100 


100 


0.0 


0.0 


0.0 


0.0 


72.2 


96.8 


98.4 


94.3 



Table 5: Performance of algorithm variants on simulated data. Entries show the per- 
formance of the Viterbi algorithm (Vit), and the AMAP algorithm with different settings of the 
gap-factor parameter (0, 0.5, and 1) using three accuracy measures (Jd, Jai, an d AMA). The 
first three columns show the configuration of the pair-HMM parameters e m atch (match emission 
probability), 5 (gap initiation probability) and e (gap extension probability), except for the last 
row for which random unrelated sequences have been aligned. Best results for every parameter 
configuration and measure are shown in bold. All numbers have been multiplied by 100. 



timate the probability of insertion and deletions. 
To validate that we calculated the "true" transi- 
tion probabilities for each alignment group using 
the reference alignments, and repeated the ex- 
periment. 

The performance of the algorithms using the 
"correct" transition probabilities are shown in 
the right columns of tables |1] and 01 As ex- 
pected, with the correct parameters, the accu- 
racy of the alignments achieved when the gap- 
factor is set to 0.5 are very close to the best 
(Gt = 1). Note that we did not modify the 
emission probabilities, which might be the reason 
Gf = 0.5 did not maximize the actual accuracy. 
These results show that the AMAP algorithm 
significantly outperforms the Viterbi and MEA 
algorithms (61.5 AMA compared to 53.3 and 
49.2 AMA on the Superfamilies dataset, and 57.3 
AMA compared to 46.7 and 37.2 on the Twilight 
Zone dataset). Moreover, with the adjusted pa- 
rameters, the Viterbi algorithm outperforms the 
MEA algorithm (Gf = 0) on both datasets. This 
is due to the over-alignment problem of the MEA 
algorithm, which uses the expected fjj score as 
its objective function at the expense of the /m 
and AMA scores. Note that the best AMA scores 



achieved with the default transition probabilities 
are very close to those of the correct probabili- 
ties, demonstrating that adjustment of the gap- 
factor parameter is able to compensate for bad 
estimation of the parameters of the underlying 
probabilistic model. 

In order to further analyze the performance of 
the AMAP algorithm compared to the Viterbi 
and MEA algorithms, we also conducted simula- 
tion studies. Table |S] compares the performance 
of the Viterbi, MEA, and AMAP variants on dif- 
ferent sets of simulated pairs of related and un- 
related DNA sequences. 

Data was simulated using a pair-HMM to gen- 
erate aligned pairs of nucleotide sequences. The 
pair-HMM parameters included the transition 
probabilities 5 (gap initiation) and e (gap exten- 
sion). For simplicity we fixed the initial proba- 
bility Ttmatch of starting in a Match state to be 
1 — 25. For the emission probabilities we used a 
simple model that assigns equal probability (I) 
to any nucleotide in the Insert or Delete states, 
6m | tcfa probability for a match in the Match state, 
and 1 ~ e ™ tch probability for a mismatch in the 
Match state, where e ma t c h is the probability to 
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Program Twilight Super families Twilight-FP Superfamilies-FP Overall 
Align-m 5L7 56J 8L5 TU> WLO 
ProbCons 37.3 51.8 55.7 55.0 51.1 
AMAP 46.1 56.3 75.8 75.8 64.4 

AMAP-4 52.2 57.9 84.2 84.6 70.2 

Table 6: Performance of selected programs on the SABmark benchmark datasets. En- 
tries show the AMA score for each program and data set. All numbers have been multiplied by 
100. 



emit a pair of identical characters in the Match 
state. 

For every setting of the parameters we gener- 
ated 10 reference alignments with min(n, m) = 
1000. An identical pair-HMM with the same 
parameters was then used to compare the per- 
formance of the Viterbi algorithm and MEA al- 
gorithm with gap-factor values of (0, 0.5, and 
1). We treat every set of 10 alignments as 
one big alignment, and calculate the accuracy 
(g(h p ,h r )) of the predicted alignments, the fjj, 
and /m scores. In addition to the simulated ref- 
erence alignment generated from the pair-HMM, 
we also generated 10 pairs of unrelated random 
sequences of length 1000 each with equal prob- 
ability for every character. All algorithms have 
been evaluated on the resulting reference align- 
ments, which include no H characters, using the 
probabilities 0.8, 0.1, and 0.9 for the e ma tch, 8, 
and e parameters respectively. 

The simulation results demonstrate that the 
AMAP algorithm produces alignments that are 
more accurate than the Viterbi and MEA align- 
ment algorithms on both closely related and dis- 
tant sequences. As expected the best fjj scores 
are achieved using the MEA algorithm (Gf = 0), 
the best /m scores when Gf = 1, and the best 
AMA when G f = 0.5. 

It is interesting to note that for distant se- 
quences with larger gap initiation probability 
(5), the Viterbi algorithm has better AMA score 
than the MEA algorithm (Gf = 0). This again 
emphasizes the main weakness of MEA, which 
tends to over-align unalignable regions. This 
problem is even more pronounced when align- 
ing unrelated sequences. The MEA algorithm 
performs poorly compared to the AMAP algo- 



rithm and even the Viterbi algorithm, achieving 
a mere 72.2 AMA scored compared to 96.8 and 
94.8 respectively. This is due to the fact that 
the MEA algorithm wrongly aligns 2781 charac- 
ter pairs, compared to 157, 316, and 572 in the 
case of Gf = 1, Gf = 0.5, and Viterbi alignment 
respectively (data not shown). 

Finally, we compared the performance of the 
multiple sequence alignments version of the 
AMAP algorithm compared to ProbCons and 
Align-m. Table |H1 shows the AMA scores of each 
program on the four SABmark datasets. AMAP 
and Align-m are clearly superior to ProbCons. 
While AMAP with default parameters achieves 
slightly lower AMA scores than Align-m, setting 
the gap-factor to 4 produces the most accurate 
alignments. This demonstrate the power of the 
AMAP algorithm, which can easily be tuned us- 
ing a single parameter to improve alignment ac- 
curacy even when the parameters of the under- 
lying statistical model (transition and emission 
probabilities of the Pair-HMM) do not fit the 
data very well. 

5 Discussion 

We have proposed a metric for the set of align- 
ments, and shown how it can be used both to 
judge the accuracy of alignments, and as the 
basis for an optimization criteria for alignment. 
The importance of the metric lies in the fact that 
if two alignments are far from each other, we 
can conclude that at least one of them is in- 
accurate. This is a direct consequence of the 
triangle inequality. More importantly, we show 
that when alignments made by widely used soft- 
ware programs are compared to each other they 
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are far apart, thus quantitatively confirming that 
multiple alignment is a difficult problem. Al- 
though we see that the sensitivity of many pro- 
grams is high, i.e., many of the residues that 
should be aligned together are correctly aligned, 
it is also the case that many residues are incor- 
rectly aligned. This is particularly evident in 
results from the Twilight-FP and Superfamilies- 
FP datasets that contain unrelated sequences. 
If functional inferences are to be made from se- 
quence alignments, it is therefore important to 
control for specificity, and not only sensitivity. 
Our alignment algorithm, AMAP, which maxi- 
mizes the expected AMA, outperforms existing 
programs on benchmark datasets. 

Most exiting multiple alignment benchmark 
datasets include only alignments of "core 
blocks" , and it is therefore only possible to mea- 
sure the sensitivity of matches (/b), and not 
their specificity (/m) or the AMA. However, the 
fact that it is harder to construct datasets that 
allow for measuring the latter two does not mean 
that alignment algorithms should maximize sen- 
sitivity at the expense of specificity. Our AMAP 
algorithm is the first to allows the user to tune 
the inherent sensitivity/specificity tradeoff using 
the gap-factor parameter. In many cases, such 
as when using MSA for phylogenetic tree recon- 
struction, or for identification of remote homol- 
ogy, higher specificity is preferred over higher 
sensitivity. In addition, as we have shown, tun- 
ing the gap-factor parameter can in some cases 
compensate for poor parameter estimation of 
the underlying probabilistic model (pair-HMM). 
Further work is needed to develop methods for 
automatic adjustment of this parameter for a 
given dataset when the probabilistic parameters 
do not fit the data very well, and a reference 
alignment is not available. 

In the typical case where reference align- 
ments are not available, our empirical observa- 
tion that the distances between alignments cor- 
relate strongly with the accuracy of the pro- 
grams that generated them can be used to dis- 
card inaccurate alignments. It is possible that 
more sophisticated strategies based on this prin- 
ciple could further help in quantitatively assess- 
ing alignment reliability. 



We have not discussed the relevance of our re- 
sults to DNA multiple alignments, however many 
of our ideas can be easily adapted. As with pro- 
tein multiple alignment, the focus in DNA align- 
ment has been on sensitivity rather than speci- 
ficity. For example, whole genome alignments 
are often judged by exon coverage. We have fo- 
cused on protein sequence in this initial study, 
because in certain cases reference alignments can 
be constructed based on structural alignments. 

Finally, we mention that it is possible to for- 
mulate MEA multiple alignment using our AMA 
as an integer linear program using ideas similar 
to those in |IJ[E|- In particular, it is possible to 
set up a program with a polynomial number of 
variables and constraints. It should be interest- 
ing to study the possibility of applying approxi- 
mation methods to solving such programs. 
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